Emergence of clones in sexual populations 
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In sexual population, recombination reshuffles genetic variation and produces novel combinations 
of existing alleles, while selection amplifies the fittest genotypes in the population. If recombi- 
nation is more rapid than selection, populations consist of a diverse mixture of many genotypes, 
as is observed in many populations. In the opposite regime, which is realized for example in the 
facultatively sexual populations that outcross in only a fraction of reproductive cycles, selection 
can amplify individual genotypes into large clones. Such clones emerge when the fitness advantage 
of some of the genotypes is large enough that they grow to a significant fraction of the popula- 
tion despite being broken down by recombination. The occurrence of this "clonal condensation" 
depends, in addition to the outcrossing rate, on the heritability of fitness. Clonal condensation 
leads to a strong genetic heterogeneity of the population which is not adequately described by 
traditional population genetics measures, such as Linkage Disequilibrium. Here we point out the 
similarity between clonal condensation and the freezing transition in the Random Energy Model 
of spin glasses. Guided by this analogy we explicitly calculate the probability, Y, that two in- 
dividuals are genetically identical as a function of the key parameters of the model. While Y 
is the analog of the spin-glass order parameter, it is also closely related to rate of coalescence 
in population genetics: Two individuals that are part of the same clone have a recent common 
ancestor. 



Genetic diversity is the fodder for natural selection and the fuel of evolution. It is generated by mutations and by 
recombination, which reshuffles genomes and thereby accelerates the exploration of the space of genotypes. The latter 
consists of all of the 2 L possible combinations of the genetic variants, a.k.a. alleles, present at L (biallelic) polymorphic 
loci. Of course the number of polymorphic loci, L, itself changes as new mutations arise forming new polymorphisms 
while "older" polymorphisms disappear from the population. The population itself consists of N individuals which 
sample only a small fraction of the possible genotypes, i.e. iV < 2 L . The dynamics in genotype space is therefore 
highly stochastic. 

Of particular importance are those genetic polymorphisms that affect the fitness of individuals, the fitness being 
defined as the expected number of offspring in the next generation. Selection on its own would amplify the number of 
high fitness individuals and condense the population into a few "clones" comprising a large fraction of the population. 
In populations of sexually reproducing organisms, the growth of such clones and the subsequent decline of genetic 
diversity are checked by recombination. Two parents, if chosen from different clones, produce offspring that are distinct 
from either parent. In obligate sexually reproducing species, the formation of clones is prevented since reproduction 
is coupled to recombination and no parent can produce genetically identical offspring. Many species, in particular 
microbial species and plants, can reproduce both by clonal reproduction (e.g., budding in yeast, selfing or vegetative 
reproduction in plants) or by sexual propagation. Such facultatively sexual species display a great variety in their 
mode of propagation, the frequency r of outcrossing, and the heritability of fitness. The latter is very important in 
sexual populations, as it determines to what extend recombinant offspring benefit from the same fitness advantages 
that made their parents successful. 

The aim of this article is to describe quantitatively the competing tendencies of natural selection and recombination 
with regard to genetic diversity, focusing on facultatively sexual organisms. The competition of natural selection with 
recombination is the dominant mechanism of evolution on relatively short time scales, on which mutational input is 
negligible compared to diversification by recombination. This situation is particularly relevant to adaptation following 
a major outcrossing event, or within a so called hybrid zone, where diverged genotypes have come together to generate 
a hybrid population. As this hybrid population continues to breed within itself, it can give rise to a bout of rapid 
adaptation, as beneficial alleles from both original populations are combined to form novel fit genotypes that spread 
within the hybrid population ( |Ortfz-Barrientos et al\ |2QQ2[ ) . 

We will focus on the probability, F, that two random individuals sampled from the population have the same 
genotype, i.e., are clones of each other. This quantity is important for population genetics, since it characterizes 
genetic diversity (its inverse is a measure of the number of dominant clones) as well as the dynamics of coalescence. 
Whenever two individuals are part of the same clone, they share a recent common ancestor, such that Y is proportional 
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to the rate of pair coalescence. In the canonical theory of neutral coalescence, this rate is equal to the inverse population 
size. We will find here that Y, and with it the rate of coalescence, is determined by the clonal structure rather than 
the population size at low outcrossing rates. 

Much of our analysis is presented in the context of a facultatively sexual population, where the evolving entities are 
individuals and their genotypes. Note, however, that some of our considerations also hold for contiguous segments 
of chromosomal DNA that are short enough to undergo only infrequent recombination even in obligatory sexual 
reproduction. In that case, we would be interested in the probability of a given chromosomal segment to be identical 
for a random pair of individuals drawn from the population. In this context, Y is the homozygosity of the population 
at this extended locus at which many different alleles segregate (this is of course a much weaker condition than clonal 
relation of whole genomes). A complementary view of this probability relates it to the "haplotype diversity", i.e., the 
number and population distribution of distinct genomic sequences for the chromosomal segment in question. We shall 
return to this important qu estion in the discussio n. 

In an earlier publication ( |Neher and Shraiman| 2009 ) , we have shown that clones are absent in the so called Quasi 
Linkage Equilibrium (QLE) "phase" corresponding to frequent outcrossing limit but appear in a regime of small 
out-crossing frequency r < r c , with r c depending on the complexity of the fitness landscape (i.e. the extent of fitness 
additivity) and weakly dependent on N. We will review this finding and present a more detailed analysis of the 
time dependence of this condensation phenomenon, as well as its quantitative dependence on fitness additivity and 
hence herit ability. Furthermore, we also study the extend of clonal condensation in a steady state where the fitness 
distribution is moving towards higher fitness at a constant velocity. We will put these results into the context of 



the Random Energy Model (REM) of Statistical Physic s, introduced and solved by Bernard Derrida (Derrida, 1981 
Gross and Mezard, 1984 Mezard and Montanari, 2009). In fact, Y is closely related to the Parisi order parameter 



and the onset of clonality is closely related to the spin-glass transition observed in simple models of disordered media 
such as the REM. 

Below, we will first draw the analogy between the dynamics of selection in finite populations and the REM. This 
analogy is particularly simple for r = 0. Whereas the condensation transition in the REM occurs below a certain 
critical temperature, the transition to clonal population structure occurs beyond a certain critical time, t > t c (N). 
Hence the population genetic analog of temperature will be the inverse time. We shall then generalize the model 
in order to include (facultative) recombination and fitness landscapes with varying degrees of epistasis, i.e., genetic 
interactions. The results for mixed epistatic and additive models enables us to set the analysis into the context of 
the "traveling wave" approximation, which has recently emerged as a powerful representation of adaptive population 



dynamics in genetically diverse populations (Cohen et a/., 2005a| Desai and Fisher| 2007 Hallatschek, 2011 Neher 



et al. 2010 Rouzine et al. t 2003 Tsimring et aZ. , 1996). Our results therefore provide insight into how recombination 



and epistasis affect the dynamics and structure of adapting population waves and define conditions under which 
genetic diversity is maintained or lost. 



I. NATURAL SELECTION AND THE RANDOM ENERGY MODEL. 

In the absence of recombination or mutation, the frequency of any individual genotype increases if its fitness 
lies above the population mean and decreases otherwise. Identifying fitness with relative growth rate and ignoring 
stochastic effects, the expected number of individuals rii(t) with genotype gi and fitness Fi obeys: 

fn(t) = (Ft - (F) t )m(t) => m(t) = e Fit ~fo *'<*V , (1) 

where hi{t) is the time derivative of rii(i) (we assume n^(0) = 1) and the mean fitness (F) t is defined by averaging 
over the whole population. Defining the rate of growth of clones by differential fitness (relative to the population 
mean) ensures constant population size: X!i n iW = N. Since the mean fitness term is shared by every genotype in 
the population, the frequency of a particular genotype is given by 



e 



where Z(t) is a time dependent normalization constant known as partition function. Hence the distribution of clones 
in the population has the Boltzmann form, where inverse time plays the role of temperature. The dynamics of the 
population will now depend on the initial distribution of fitness values Fi. In the generic case where fitness depends on 
many loci, each one giving a small contributions to fitness, the density of fitnesses p{F) = A/*(0,cr 2 ) is approximately 
Gaussi an (with zero a verage and variance a 2 ) and the statistics of clones in the populations is identical to that of the 
REM ( |Derrida| |1981 ). For small t population averages are dominated by the vicinity of the peak of p(F), with many 



individual genotypes contributing. However, for t > t C: the dominant contribution shifts all the way to the leading 
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edge 

F m ax ~ o a/2 log TV, which corresponds to the maximum fitness sampled from the Gaussian p(F) in a population 
of size TV. This means that with increasing time, i.e., decreasing "effective temperature", the population shifts to 
fitter and fitter genotypes and eventually, for t > t c , "condenses" into the fittest. This condensation phenomenon 
manifests itself in a non-negligible probability Y that two randomly chosen individuals from the population have the 
same genotype. The latter is equal to the average squared genotype frequency in the population at time t. The 
average participation ratio is obtained by averaging over the fitness values {Fi} of the N initial genotypes 
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(3) 



where we have used the integral representation of ft 2 = J °° dz ze zQ a nd the fact that the Fj are the i.i.d. random 



varia bles sampled from p(F). This calculation is carried out explicitly in (Mezard and Montanari 2009 Mezard et al. 



1985) and leads to the following quantitative result: In the limit of large populations, (Y t ) is given by 



{Yt) 



fo(N~ 



t < t C 



t > t n 



(4) 



with t c « (j~ 1 ^/2 log TV. Fig.[T]shows how (It), measured in a computer simulation (see below), increases in time and 
compares to the REM prediction. Note that the sharp transition exhibited in [4] is realized only in the limit of large 
log TV, which is hard to achieve both in reality and in a numerical simulation. In both cases one expects to find a 
crossover rather than a sharp transition. Nevertheless, considering the idealized N — >• oo limit provides a very useful 
scaffold for the analysis, and the REM also allows us to compute the 1 / log N corrections and therefore determine the 
detailed nature of the crossover. 
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FIG. 1 The participation fraction (Yt) in the absence of recombination as a function of time. For each N, the time axis 
is rescaled by t c = cr~ 1 \/2 log Na. The fitness variance a 2 = 0.0025 and is included in the expression for t c to account for 
stochastic effects. The solid black line shows the N —> oo asymptotic result predicted by the REM (Eq. Q). The convergence 
of the numerical results to this asymptotic result with increasing N is slow because it is governed by log N. Each curve is 
averaged over 1000 runs. 



Recombination and heritability 



Sexual reproduction mixes the genetic material from two parents and thereby produces new genotypes from existing 
genetic variation. To account for recombination, we modify the equation describing the evolution of genotypes as 
follows: 



h(g) = (F(g)-(F))n(g)+r 



N- 1 Y,K{g\g\g")n{g')n{g")-n{g) 



(5) 
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where n(g) is the number of individuals with genome g, and K(g\g f ,g") accounts for the probability that genotype 
g is assembled by recombination from the parental genotypes g' and g" . In the absence of recombination, the only 
relevant characteristic of a genotype is its fitness, and we could study the evolution of the population along the fitness 
coordinate instead of on the hypercube of possible genotypes. To achieve a similar simplification with recombination 
we need to know how the fitness of recombinant offspring relates to that of the parents to map Eq. (J5|. The correlation 
between parental and offspring traits is known as heritability and depends on the underlying genetic architecture. If 
a trait depends on many loci in the genome in an additive manner, i.e., different loci make independent contributions 
to the trait, the trait value of offspring will be approximately Gaussian distributed around the parental mean. Such 
traits are called highly heritable since the correlation between trait values of parents and offspring is high. Conversely, 
if a trait depends on specific combinations of alleles at many loci (epistasis), these combinations will be disrupted 
with high probability in sexual reproduction. Such traits have a low heritability in sexual reproduction since the 
correlation between parents and offspring is low. 

The trait we are mainly interested in here is fitness, which in general involves many phenotypes and depends on 
the environment. We shall set aside the issues associated with fluctuating environment and possible time-dependence 
of selection, and focus on how fitness depends on the genotype. As already noted, the genotype space is a high 
dimensional hypercube, g = (si, . . . , sl), and for present purposes, the fitness is a complicated (but fixed) function of 
the genotype parameterized as 

F(g) = fo + Yl ^ + te 3 *** + ' ' ' ' ( 6 ) 

i i<j 

where the terms define the additive contribution of locus i to fitness, while higher order terms correspond to 
epistatic interactions. The relation of the fitness of an offspring relative to that of its parents and the density of 
states, i.e., the distribution of fitness over all possible genomes, is illustrated in Fig. [2] 




FIG. 2 Heritable and non-heritable contributions to fitness. Left: The two panels illustrate heritable and no n- heritable 
fitness functions. The black Gaussian represents the density of states over all 2 L possible genotypes, while the arrows indicate 
the fitness of parents Pi ,2 sampled from this density of states. For additive fitness functions (top), the fitness distribution 
of recombinant off-spring of parents Pi ,2 (red curve) is centered around the mid-parent value, i.e., fitness is heritable. For 
completely epistatic fitness functions (bottom) , the offspring fitness is a random sample from the density of states and therefore 
not heritable. Right: Fitness is a sum of additive (heritable) and epistatic (non-heritable) components and selection amplifies 
individuals with large F = A + E. In sexual reproduction, only the additive component of fitness is heritable, so only the 
additive fitness increases in time with rate v. 



The higher the order of the interactions, the less likely it is that a particular set of loci that contributes to one parent 
is inherited uninterupted. As a consequence, the heritability of interaction terms goes down with increasing order. 
Interaction terms of high-order are essentially independent of the parents. This is reminiscent of high-order spin glass 
models, where the energies of any two configurations that differ at a macroscopic number of spins are uncorrelated. 
The large p limit of such p-spin glasses is the random energy model, where the energy of each configuration is an 



independent draw from a Gaussian distribution (Derrida, 1981 Gross and Mezard, 1984). 
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II. THE MODEL 

In general the genetic architecture of fitness is expected to be complex with additive contributions as well as epistatic 
contributions of various orders. It is very instructive, though, to consider a simplified model that includes a heritable 
component and a random epistatic component. Specifically, let us assume that fitness can be decomposed into an 
additive component A and an epistatic component E. If two individuals with fitness F\ = A\ -\-E\ and F2 = A2+ E2 
produce an offspring, its additive component, A, is drawn from a Gaussian with variance (J 2 A /2 around the parental 
mean (Ai + A2V2, while its epistatic component, E, is independent from that of the parents and is drawn from a 
Gaussian centered around with variance <j e . Hence the recombination kernel of this model depends only on the 
additive fitness of the parents 

(A-(A 1 +A 2 )/2) 2 E 2 

K(A,E\A 1 ,A 2 ) = -= e °\ -I (7) 

For much of the analysis below, we will simplify this model even further and assume that the additive fitness of 
recombinant genotypes is independent of that of the parents and simply drawn from a Gaussian with variance a\ 
centered around the population mean (A). This distribution is expected if the distribution of A in the population is 
approximately Gaussian. This simplified model is easier to analyze, while displaying the same qualitative behavior, 
as has been checked by numerical simulations. The joint distribution of A and E in the population then evolves 
according to 

(A-(A)) 2 e 2 

P(A, E) = (F - (F) - r)P(A, E) + e 2CT i (8) 

2lT(J E (J A 

The ratio of cj a and cte define the extent of fitness heritability in the process of recombination. We define "heri- 
tability" , h : which will be one of the key independent parameters of the model, as: 

h 2 =a 2 A /(a\+a 2 E ) (9) 

To illustrate the behavior of the model at different recombination rates and different heritabilities, we implemented 
it as a computer simulation. The computer program keeps track of clones with fitness Ai and E{ and population size 
rii. At each generation, the size of a clone is updated by a Poisson distributed number with mean nie Fi ~( F ^~ r + c , 
where C = (1 — N/Nq) is a density regulating term. A clone is deleted if its size is 0. In addition, at each generation, 
Nr new clones of size 1 are seeded, with additive fitness drawn from a Gaussian N((A),ga) and epistatic fitness 
drawn from A/"(0, og;). Alternatively, we can impose a "velocity" of the additive fitness by drawing its value from 
Af(vt, a a) instead of J\f((A), a a)- 

Due to the stochastic nature of reproduction, the majority of all initial genotypes will rapidly die out, even if very 
fit. Of the N initial genotypes, only a fraction ~ a remains, where a 2 = o\ + a E is a measure of the typical strength 
of selection. Similarly, of the clones produced by recombination, only a fraction <j "establishes". Those clones that do 
establish are on average larger than the deterministic expectation by a factor of a -1 . We will neglect them in most 
of the formulae below. The dominant effect of this stochasticity can be accounted for by rescaling N to crN inside 
logarithms. We will reinstantiate this correction in comparisons to simulations when necessary. This stochastic effects 
are of minor importance for the phenomena we study here since they are overshadowed by the randomness inherent 
in the fitness and seeding time of new clones. 



III. RESULTS: STRUCTURE OF ADAPTING POPULATIONS 

Depending on the rate of recombination r and the degree of fitness heritability h, the model formulated above 
exhibits very different behaviors. Before we present a formal characterization of the population structure and the 
dynamics of evolution, it is instructive to discuss the dynamics of the model as observed in simulations. 

Fig. [3] shows the distribution of additive and epistatic fitness at two different times for scenarios where additive 
(left) or epistatic (right) fitness dominates. If fitness is predominantly additive, the population adapts and moves 
towards high additive fitness as long a new genotypes are generated by recombination. The velocity is given by the 
variance of the population along the additive direction. If most of the variance is along the epistatic direction as in 
the right panel, the adaptation of the population is much slower. In addition, large fractions of the population tend 
to be condensed into a small number of clones, as we will discuss at greater length below. As the variance along the 
additive direction decreases to zero, so does the velocity. 

The population structure and dynamics at different recombination rates are illustrated in Fig. [4j The figure shows 
the composition of the population as a function of time for different recombination rates and different heritabilities. 
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FIG. 3 The distribution and dynamics of additive and epistatic fitness in the population at moderate recombination rates 
r — a. Panel A: Additive fitness dominates (a 2 A = 0.8cr 2 , a% = 0.2a 2 ). Panel B: Epistatic fitness dominates (a 2 A — 0.2cr 2 , 
a% = 0.8a 2 ). In both panels, a 2 = 0.0025 and N = 10 5 . 



In each panel, the population was initialized as a diverse sample from the density of states and was subsequently 
allowed to evolve via selection and recombination. Hence each panel shows a transient, which gives way to a steady 
state behavior. Each genotype in the population is assigned a specific color, and individuals are ordered according to 
fitness in each time slice with the fittest individuals at the bottom. 

The left column of Fig. [4] shows evolution governed by an all additive fitness function (a a = cr, &E — 0) for different 
recombination rates. In this case, the population moves steadily to higher fitness since new genotypes, fitter than 
their parents, are constantly produced. At low recombination rates, the population consists of a small number of 
clones that arise at the high fitness edge (bottom part of the panel), grow as time progresses, and shrink again once 
they fall behind in fitness (i.e., when they move towards the center of the panel). As the recombination rate is 
increased to values comparable to a, large clones cease to exist. Most of the population is made from nearly unique 
genotypes that are short lived. This pattern becomes even more pronounced at larger recombination rates. The high 
recombination limit of this dynamics can be understood in mean field theory (MFT) described in Section III. A The 
low recombination limit is described in greater detail in Section III.D| 

The right column of Fig. [4] shows the opposite limit, when fitness is not heritable but completely epistatic, corre- 
sponding to a fitness function with random high order interactions. In this case, we do not expect the population to 
move towards higher fitness indefinitely since the epistatic fitness is non-heritable. At low recombination rates, we 
expect that the fittest genotype in the population will grow while generating new recombinants distributed around 
zero fitness. With the growing genotypes, the mean epistatic fitness will increase until the selection on the fittest 
genotype, E max — (E), equals the rate at which it produces recombinants. This behavior is clearly seen in all four 
panels with h 2 = on the right. Since the recombination rate increases from top to bottom, the size at which the 
fittest genotype stabilizes decreases, while the "dust" -like fraction of the population that consists of short-lived unfit 
recombinants increases. Occasionally, recombination generates exceptionally fit genotypes, which have a chance of 
replacing the previous record holder. As time progresses, these records become rarer and rarer according to the well 
known results on record dynamics 1 . The clone structure and its dynamics for the fully epistatic case will be discussed 
in Section III.B The high recombination limit can again be understood within mean field theory; see Section [ill. A 



The center column of Fig. [4] shows the clone structure and dynamics of a case of intermediate heritability. At low 
recombination rate, the population is dominated by clones, and clones exist even when the additive population is 
only dust. However, unlike the h 2 = case, no clone dominates forever, but new clones are continuously established. 
The velocity in this regime will depend critically on how often such new clones are produced and on how much they 



advance the additive fitness. We will investigate this case below in Section III.D 



Note that a slightly different dynamics is observed if recombination of individuals with the same genotype do not produce a novel 
genotype. In this case the effective outcrossing rate of large clones goes down and the population quickly condenses into a unique clone. 
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u(E) 




E = r+(E) E 



FIG. 5 The population consists of mediocre recombinant genotypes (dust) and leading clones. A sketch of the population 
structure, indicating the smooth quasi-static distribution uj(E) at E < (E) + r and the exponentially growing clones with 
E> (E)+ r. 



A. Quasi-Linkage Equilibrium (QLE) and its breakdown 



At large r, this model admits a factorized solution P(A,E) = $(A,t)uj(E) with 



(A-a A tY 



and 



u(E) 



(E) - E ^2^1 



(10) 



derived in (Neher and Shraiman 2009 ). This solution travels toward higher additive fitness with a velocity equal to the 



variance of the additive fitness of recombinant offspring, while the epistatic fitness has steady distribution w here (E_ 
adjusts itself to normalize the distribution. This factorization is a hallmark of Quasi-Linkage Equilibrium (iKimura 



1965 Neher and Shraiman 2011a Turelli and Barton! 1994), where additive components evolve independently, while 



epistatic components are in a quasi-steady balance between selection and recombination. However, this factorized 
solution breaks down as soon as there are individuals with epistatic fitness larger than r + (E). In that case this 
solution is no longer normalizable, and additive and epistatic components cease to be independent. 

Fig. [5] illustrates this condensation behavior. The smooth distribution uj(E) is a deformed Gaussian which diverges 
at E = r + (E). Beyond r + (E), the population consists of growing clones whose size depends on when and where 
they were seeded. 

In the following we will characterize the population structure and study how the velocity in the direction of increasing 
additive fitness depends on parameters. We will begin by studying the purely epistatic case with recombination, 
which extends the REM by continuous seeding of novel recombinant genotypes. We will then study the condensation 
phenomenon in presence of additive fitness, where the population forms a traveling pulse in fitness. 



B. Clonal condensation for zero heritability 



Suppose we start with a diverse sample of size N from the distribution of epistatic fitness and inject new recombinant 
genotypes with rate Nr. After a time t, the population consists of N initial clones and a Poisson distributed number 
of new clones. A clone of age T{ and fitness F{ = Ei will have approximately size 



7li(t) 



^Ei-^n+St^.dt'Wt, 



(11) 



where the mean fitness (E) t > can be thought of as a Lagrange parameter that keeps the overall population size 
constant. To calculate (Y t (r, h)) we have to average over the fitness of the N initial clones and over the fitnesses and 
seeding times of all subsequently produced recombinant genotypes: 



(Yt(r,h)) 



I 



dz 



-Nrt 



NB (z) + ~^r kB r( z ) 



fc=l 



[\-C {z)\ N C r (z) 



\k-l 



(12) 
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where Bq(z) is the average of n 2 (t)e~ zni ^ over the E{ of initial genotypes (note that "initial" means that T{ = t). 
B r (z) is the corresponding average over recombinant clones, which in addition are averaged over their age n\ see 
Appendix |a| for derivation. Similarly, Cq(z) and C r (z) are averages of e~ zni ^ over the Ei and r^. The sum over k 
- the number of recombinants generated in time t - can be performed easily. These integrals are evaluated in the 
Appendix [B] in the limit of large N. One obtains an approximation 



<r,(r,0))^exp 



-Ne 



-V2 logiV[l- 



valid for r ~ r c . with 



(7^2 log AT. 



(13) 



(14) 



We observe that the participation ratio substantially deviates from zero only for r < r c and only upon reaching 
t > t c (r), where for r « r c 



tc(r) 



| log AT 
1 - 



*c(0) 



(15) 



This provides us with estimates of the critical time of condensation t c (r) and the critical r c below which condensation 
occurs. Condensation is delayed compared to the case of r = and the critical time (analogous to inverse critical 
temperature) diverges as the recombination rate approaches its critical value r c . Of course, this divergence only occurs 
in the limit of \/\og N going to infinity. In practice, with realistic population sizes A", one observes not a transition, 
but merely a crossover between different regimes of behavior. More elaborate expressions for t c and r c at finite N can 
be found in Appendix [B] Note that the values of t c and r c themselves scale with the square root of the logarithm of 
A". 

(lt(r, 0)) is shown in Fig.|6]A for different recombination rates. With increasing recombination rate, the early plateau 
of (Y t (r, 0)) is reduced and condensation is delayed. Each individual run condenses rapidly once the dominant clone is 
large enough that it often recombines with itself, which leaves the genotype intact. Also, the lack of condensation for 
r > r c is clearly seen. Fig. [6^3 shows the inverse time to condensation for different recombinati on rates and population 
sizes, confirming Eq. (15) for r ~ r c . This transition with increasing r was identified in ( |Neher and Shraiman 2009). 



Intuitively, the divergence of the time to condensation is due to the fact that the growth rate of best genotypes 
decreases with increasing recombination load r. As soon as E — r is smaller than zero for all existing genotypes, 
all clones are short-lived and no condensation c an occur. Similar behavior has bee n observed in populations with 
heterozygote advantage or disadvantage ([Barton 



1983 Franklin and Lewontin 



1970) 



Populatio n dynamics for r < r c , including t > t c , has the nature of a "records process" (Krug and Jain, 2005 |Sire 



et al.\ 2006). As time progresses, more and more genotypes are sampled from the density of states and tested by 



selection. As a consequence, the population will come across fitter and fitter genotypes, resulting in a record process 
with Nrt trials. Even if initially no genotype with E — r > is present, such a genotype will eventually be found and 
result in condensation. On the other hand, any finite population will eventually reach a final "record" (with fitness 
Ef), giving rise to the clone that will eventually take over the whole population. This is because a clone rapidly 
fixates once it is large enough to frequently recombine with itself. To prevent its fixation a new record would have 
to be created with the fitness advantage AE > r(Ef — r) 2 /logN within the time delay of r, which is very unlikely 
beyond certain Ef. 



C. Traveling solutions for additive fitness 

In the opposite limit when h 2 = 1, the population moves towards higher fitness with a velocity that is given by the 
additive variance v = o\ for sufficiently large r. It will be useful to parameterize the ratio between the velocity v and 
the scale of selection a 2 as 7 = v/a 2 . At high recombination rates, we have 7 = /z 2 , while we generally expect 7 < h 2 
at low recombination rates. In contrast to the case h 2 = 0, no aging dynamics is observed for h 2 > 0. Instead, old 
genotypes are constantly replaced, and dominant genotypes in the population have a finite characteristic age. Hence 
the initial genotypes are rapidly forgotten and we can restrict the analysis to recombinant genotypes. 

Genotypes seeded at the high fitness nose of the population distribution can grow large and dominate the partic- 
ipation ratio (Y). Since (Y) is closely related to the rate of coalescence, it is instructive to calculate (Y) explicitly 
assuming v = a 2 before considering the case v < a 2 at low r or h 2 < 0. 

In the steady state, a clone is specified by its age tj and its initial fitness Aj. Assuming v = a 2 , we then find 



(16) 
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FIG. 6 Panel A shows (Yt(r, 0)) for different ratios of r/a for fully epistatic fitness functions as a function of time relative to the 
critical time with r — 0. The curves shown are averages over many runs which exhibit substantial stochasticity. Recombination 
delays condensation and reduces (Y t (r, 0)) early on. The model used to produce the data in this plot accounts for the fact 
that recombination of identical parents does not produce a novel genotype, which results in rapid condensation once one clone 
becomes macroscopic. Parameters N — 16000, a 2 = 0.005, and h 2 — 0. Panel B shows how the condensation time t c (r) diverges 
as r approaches r c ~ E m ax ~ a\/2 log Na. For times and recombination rates below the lines, the population is condensed into 
clones; above the lines, no genotype is populated by a macroscopic fraction of the population. Here, t c (r) is defined empirically 
by (Yt c (r)) = 0.1. The solid black line indicates the prediction of Eq. (15) for r ~ r c . 



To evaluate (Y(r, 1)), we have to evaluate the integrals B r (z) and C r (z) that appear in Eq. (12) and are defined in 
the Appendix [Aj see Eq. ( A1Q| ). The integrals involve averaging over the initial fitness Ai and all possible ages Xj. 
For h 2 = 7 = 1, one finds CT/\z) « z because relatively young and small clones dominate. In contrast, B r (z) has a 
significant contribution from rare old clones, which dominate because of their large size (through the nf factor). The 
evaluation of the integrals is detailed in Appendix [C] with the result Eq. (C5). We obtain 



B r (z) 



a - 



T(- 



a^2 log; 



(17) 



The first term is of order 1 and corresponds to the contribution of young clones. Its exact value depends on the details 
of the stochastic dynamics, such as the offspring distribution. The second term is the contribution from old large 
clones. The participation ratio therefore becomes 



(Y(r,l)) « N [ dzB r (z)e 
Jo 



-Nz 



-^V21ogA/a- 



r/a >{y/2- 1)^2 log No 
r/a <(V2- l)^/2\ogNa . 



(18) 



For small r, (Y(r, 1)) does not scale as N -1 . In other words, the larger the population is, the larger are the clones 
it is composed of and those largest clones dominate (Y). This result is in agreement with arguments made for rapid 
coalescence in facultatively sexual populations in (Ne her and Shraiman| 2011b Rouzine and Coffin 2007, 2010). 
Fig. [7^ shows (NY(r)} for as a function of r/^2a 2 log Na. As soon as r/^/2a 2 log Na < 0.4, Y(r) increases and no 
longer scales with TV, as predicted by the mean field calculation above. The figure also shows the explicit expression 
in Eq. (p~8| ) as dashed lines. Note, however, that the calculation of (Y(r, 1)) involved several approximations where 
^J2 log N was assumed large. As a result, the accuracy of the prefactor is low, and we have dropped all non-exponential 
parts, while enforcing (Y(r, 1)) = l/N at the cross-over. 

The derivation above is valid for ra -1 y/2 log Na > 1. For smaller r, the velocity drops below the high recombination 
limit, 7 = 1, and the fitness distribution in the population stops being Gaussian. In this regime, the velocity has to 
be calculated self-consistently by matching the rat e of production of new extremely fit clones to the overall velocity of 
the population. This problem has been st udied by Rouzine and Coffin (2005, 2010) in the context of HIV evolution. 

A simplified version of the arguments of Rouzine and Coffin (2005) is given in Appendix [d] With 7 = v/0 2 , we find 



1-7 



log[fV2(T^)] 



log[fV2(l- 7 )]+Iog[ 7 - 1 rJV] 



(19) 
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FIG. 7 Panel A: The rescaled average participation fraction (NY(r)) as a function of r/^/2a 2 log Na. For large r, we find 
N(Y(r)) ~ 1, while N(Y(r)) ^> 1 for small r. The dashed lines indicate the prediction by Eq. (18), which is expected to 
be valid for r < (y/2 - l)yj2a 2 logiV. The theory curves are scaled such that Y 
reduction of velocity observed in simulations against the numerical solution of Eq. 



at the crossover. 
(19) where 7 = v/a 2 . 



Panel B plots the 



iteratively. The result is 
in the model used. The 



This implicit equation for 7 can either be solved numerically or, in the case of very small r, 
similar to that obtained by (Rouzine and Coffin, 2005); differences are due to the difference 

numerical solution is compared to simulation data in Fig. [7^3. Convergence with increasing N is slow and Eq. (19) 
has a solution only for small r, for which we have little data. 



The velocity of the traveling wave falls below a 2 as soon as a few large clones dominate 
case, not only B r (z,h) but also C r (z, ft) are dominated by those larg e clones, and (Y"(r, ft)) 
show i n the appendix that (Y(r, ft)) ~ 1 — 7 is of order 1 (see Eq. (C7)). This is in agreement 
(2005), who found that the typical number of large clones is ~ logrA^a ~ Y(r, ft) -1 . 



the population. In this 
behaves differently. We 
with Rouzine and Coffin 



D. Traveling wave solutions at intermediate heritabilities 

In the high recombination limit, the population moves towards higher fitness with velocity v = o\ regardless of any 
epistatic fitness contributions. However, we have seen above that a clonal structure can emerge even in the absence 
of epistasis if recombination rates are low enough. Intuition and the simulation in Fig. [4] show us that this clonal 
structure is more pronounced in the presence of epistasis and persists at high recombination rates. 

The reason for this behavior is the fact that the fitness variance of the recombinant offspring is larger than the 
velocity v = o\ < a 2 . In this case, fit genotype grow above the traveling Gaussian envelope and generate macroscopic 
clones. 

Fig. [4] shows that, at low recombination rate and herit ability, the population is always dominated by a few large 
clones with high non-heritable fitness components, which produce a large number of (on average) non-fit offspring. 
Rarely, such offspring are very fit, and replace their predecessors. The probability that offspring are fitter than their 
parents, and with it the velocity, increases dramatically with ft 2 and r. Conversely, the size of the fit clones and the 
participation fraction decreases with ft 2 and r. Simulation results for Y and v in the steady state are shown in Fig. [8] 
for different values of ft 2 and r. In the following, we will rationalize and quantify the observed behavior. To begin 
with, we will assume a constant velocity and calculate Y. Again, these calculations are detailed in Appendix |C| 

After successful establishment, a clone will grow approximately deterministically according to 

71 • 

-±=F j -r-vr j -{E) (20) 
rij 

where Fj = Aj + Ej is the sum of the additive fitness Aj, measured relative to vt when the clone was born, and Ej 
is the epistatic fitness. The advancing mean additive fitness is accounted for by vtj, while (E) is the mean epistatic 
fitness. The size of the clone peaks at age r* = (Fj — r)/v (where we have defined r = r + (E)) and then decreases 

until the clone disappears. The maximal size of the clone is nj 1 ^ ~ e -( F J - r ) 2 / 2v . Since the fittest genotypes in the 
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population have a fitness y/2 log N above the mean, they grow larger than N if 

r < r* = a{\ - ^)^2logNa , (21) 

which gives us a first indication of the breakdown of the mean field solut ion, where v = a\. We confirm this again 
by calculating the integrals C r (z,v) and B r (z,v) in Appendix |C| see Eq. ( |C4| ). 

Furthermore, we calculate (Y(r, h 2 )) in Appendix |C| and find that (Y(r,h )) starts to be larger than TV -1 as soon 

as 

r<r c = a(V2 - ^^logNa , (22) 

These two conditions on f define two critical recombination rates r c and r* at which different features of the mean 
field solution break down. Note, however, that this expression is not valid close to v = 0, since we have assumed a 
traveling wave with clones of finite lifetime. 




Recombination rate Recombination rate Recombination rate 



FIG. 8 Clonal condensation as a function of recombination rate and heritability. Panel A: Ratio of the traveling wave velocity 
to the total variance. At low heritability and low recombination rates, the average velocity of the traveling wave is much 
lower than the additive variance. The equality of velocity and additive variance promised by Fisher's Fundamental Theorem 
is recovered in the high recombination limit. Panel B shows the corresponding participation ratio log(Yoo(^, h)). Low velocity 
goes along with high {Yoo(r,h)). Panel C shows the coefficient of variation of the time trace Yt, i.e., std(Yt)/mean(Y£). Y t 
fluctuates strongly at intermediate recombination rates between the two critical lines identified in the main text and indicated 
by the white lines. In all panels, N = 10 6 and a 2 — 0.0025. The additive fitness is drawn from a Gaussian centered around the 
current mean additive fitness. 



Both of these two lines seem to play an important role for the behavior of the population. Between the two 
lines, we observe a coexistence between condensed populations and non-condensed populations, which results in large 
fluctuations of Y(t). An example trajectory of Y(t) is shown in Fig. [9] For a limited amount of time, the population 
is condensed with large Y and doesn't move in the additive direction. It then becomes unstuck and adapts for a while 
before getting stuck again. The time average of Y is dominated by these intermittent condensates in this meta-stable 
region and is larger than A/ --1 as soon as r < r c = <j{\[2 — ^7)^2 log N. Good simulation evidence for the different 
transition lines, however, is difficult to obtain because of substantial sub-leading corrections. 

To investigate the model in absence of this stick-slip behavior, we simulated a variant of the model where additive 
fitness is drawn from a distribution that steadily moves towards higher fitness with velocity v, as is assumed in the 
calculat ions . The time averaged (Y(r,h)) is shown in Fig. [9^3 and compared to the predicted onset of condensation 
by Eq. ([C7j (black lines). 



IV. DISCUSSION 
A. Summary 

Correlations between different parts of the genome are usually referred to as linkage disequilibrium, suggesting 
that due to genetic linkage, i.e., a high probability of coinheritance, the allele frequencies at different loci are not 
independent. Here, we have used a different measure to characterize correlations in the population. Instead of looking 
at correlations between individual loci, we have characterized the distribution of clone sizes, or equivalently haplotype 
frequencies, in adapting populations. Our analysis is not restricted to additive fitness functions, but we have also 
analyzed a simple model where fitness is partly heritable and partly random. 
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FIG. 9 Panel A shows a time trace of Yt(r,h) which exhibits strong fluctuations. For part of the time, the population is 
dominated by a single clone and does not adapt. After a while, this clone is superseded and the population starts moving 
again. Panel B: Participation ratio for different heritabilities and different population sizes at an imposed velocity of v = o\ 
(the transition is sharper if v is adju sted self-consistently). Note that above the critical recombination rate, (Y) is of order 
iV _1 . The analytical predictions (Eq. (C7)), up to the prefactor which was fixed to ensure Y(r c , h 2 ) — iV _1 , are shown as black 



lines. Dotted lines: N — 10 , dashed: N — 10 , solid: N — 10 , a 2 



0.0025. 



While macroscopic condensation, (Yqo) ~ ^(1)? se ^ s m on ly f° r r < (J / \fT\ogNcr in the absence of epistasis, 
we observe condensation of the population at recombination rates of order a in presence of epistasis. Here, a is 
the standard deviation in fitness and sets the strength of selection. The reason for this behavior is the fact that 
the velocity at which the population adapts is smaller than the fitness variance of the recombinant offspring. The 
additional epistatic variance allows the seeding of new clones far ahead of the population mean, which is only slowly 
catching up. Hence fit clones can out-grow any traveling Gaussian. At the same time, condensed clones cause 
the average epistatic fitness to be significantly greater than 0. Since this epistatic fitness is lost upon outcrossing, 
the population has a tendency to partition into a few fit clones with high epistatic fitness and a large number of 
recombinant genotypes with random epistatic fitness. This co-existence between "condensed" and "dust" phases is 
seen in Fig. [4] in the panels on the right. As long as the heritability, i.e., the fraction of additive variance, is larger 
than zero, the population seeds new clones even at low recombination rates and the rate of coalescence will be given 
by (Y) times the characteristic turn-over rate of clones. 

In the absence of any additive variance, the observed behavior is quite different. In this case, the fitness function is 



completely random (a.k.a. House-of-cards model (Kingman 1978), or random epistasis/energy model (Derrida, 1981 
|Neher and Shraiman , 2009[ )). The only way the population adapts is by sampling fitter and fitter individuals from 
the same distribution. In other words, the population dynamics amounts to a record process where the total number 
of samples taken increases as N(l + rt). Records establish and grow with the rate E — (E) — r. One additional 
complication that is of particular importance in the case h 2 = is the fact that whenever a clone recombines with 
itself, it does not generate a novel genotype. This has the tendency to shut off recombination and stabilize clones as 
soon as they grow large, resulting in a rapid loss of genetic diversity. In a previous publication ( |Neher and Shraiman| 
2009), some of us have studied the onset of condensation in a more descriptive manner. Here, we have extended 



this work by explicitly calculating Y, both during an initial transient as well as in a steady state where variance is 
maintained. 

The model we have used is extremely simplistic, and one might wonder about its relation to real world popula- 
tions. Nevertheless, it accounts for a number of features of real populations such as heritabilities between and 1, 
outcrossing, and mimics a large number of loci in the sense of quantitative genetics. These features give rise to quali- 
tatively different dynamical regimes, which will also be observable in more realistic models. Some facultatively sexual 
populations are in fact remarkably close to this simple "E and A" model. Many plants and microbial populations 
are facultative outcrossers. In the event of outcrossing, a large number of crossovers on many chromosomes produces 
many independently inherited genomic segments. HIV, for example, recombines via template switching following 



coinfection at an outcrossing rate of a few percent (Batorsky et al. 2011 |Neher and Leitner , 2010). In each of these 



outcrossing events, roughly 10 crossovers are observed ( |Levy et al 2004[ ). If populations are polymorphic at many 
loci, the resulting off-spring distribution is approximately described by Eq. (|7|). We have made a further simplifica- 
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tion by assuming that the fitness of a recombinant offspring is independent of its parents and simply drawn from a 
comoving distribution. This assumption is expected to approximate recombination process es where the offspr ing and 
parent fitness decorrelate rapidly over a few out-crossing events, as for example in Eq. ^ ( |Neher et al. 2Q1Q| ). Note, 
however, that loci close together on the chromosome decorrelate only slowly. 

The other dramatic simplification made above was the partitioning of fitness into additive and random components 
corresponding to high order epistatic interactions. More generally we expect to find epistatic interactions of various 
orders, which are heritable to different extents. However, the "E and A" model should not be thought of as a 
parameterization of the genotype-fitness map, but rather as a pa rtitioning int o variance that can be explained by the 
best fitting additive model, and the remaining variance (Fisher, 1930 Neher and Shraiman, 2011a). The best-fitting 



additive model is in general time dependent, and the heritability can change as the allele frequencies change (Turelli 
|and Barton| |2006| ). We know rather little about the genetic architecture of fitness, which justifies the the use of such 
simple models. In specific scenarios where the genotype-phenotype map is known, more detailed modeling should be 
guided by the general conclusions drawn from the "E and A v model. 

The variation that we assume is always present among the recombinant off-spring, it is ultimately fueled by mu- 
tations. The balance between the influx of beneficial mutations and selection in facultatively sexual population has 



been investigated in (Cohen et al.\ 2005b |Neher et al. , |2010|). Even in sexual populations, the dynamics of mutation 



can be strongly affected by the selection through the clonal structure of the population. 



B. Why is Y important? 



The participation ratio, Y, is exactly the probability for two individuals to be genetically identical. Therefore, 
it immediately gives a measure of the clonal structure of the population. If the two genotypes are identical, they 
have had a common ancestor in the recent past. Hence, (Y) is proportional to the rate of coalescence, and as 
soon as (Y) is no longer proportional to TV -1 , coalescence is greatly accelerated relative to a neutral model. It is well 
known that selection accelerates coalescence since fit individuals have more offspring and dominate future generations. 
Recombination tends to reduce the effects of linked selection since it decouples different regions of the genome. We 
have calculated the rate of coalescence in our model, which is set by a balance between selection and recombination. 
We have shown that there is a critical recombination rate, where recombination is overwhelmed by selection and the 
population structure changes qualitatively. 

In the case of additive fitness functions, we have found that (Y) ~ exp 



with earlier work (Neher and Shraiman, 2011b Rouzine and 



JV2 logTVcr 



which is in agreement 



2010). In this case, the population consists of 



clones apparent as little "bubbles" in the representation of Fig. 141 Any such bubble originates from a common 
ancestor ~ cr~ 1 \/2 log Ncr generations in the past, and (Y) is the probability that two individuals belong to the same 
"bubble". This bubble-coalescent is similar to ideas developed for structured populations (Nordborg 1997) or the 



fitness class coalescent ( jWalczak et a/.||2012| ). Note, however, that the clone-size d istribution is ver y long tailed and 



the bubble coalescent is no t in the universality class of the Kingman coalescent (|Kingman[ 
Bolthausen-Sznitman type (iBrunet et al. 



20071 Mohle and Sagitov, 2001 Neher and Shraiman, 2011b) 



1982), but possibly of 



Genetic identity between some of the individuals reduces the effect of outcrossing, since identical parents produce 
identical offspring. Since the probability of such an occurrence is equal to Y, the effective rate of recombination in 
the partly clonal population is r e ff = r(l — Y(r e ff, h)). Hence, strictly speaking our calculations of Y(r, h) deep in 
the clonal condensation regime must be taken through a self-consistency step, which replaces r that was hereto an 
independent variable by a dependent variable r e ff. This however would not change our estimates for r c (h) and t c (r, /i), 
which are defined by the point of first emergence of clones (when Y rises above 0(1/N)). Going beyond the MFT 
description of recombination, one may define a participation ratio density T(F) in terms of which Y = j dFT(F) 
which picks up the fitness dependence of Y: individuals with relatively high fitness are much more likely to be clonally 
related. 

The significance of Y is not limited to the case of exact genetic identity within clones. In particular, mutations 
would introduce additional polymorphic loci into the "clones" that were the focus of our study. Yet the genetic 
structure of the population introduced by clonal condensation still survives: one only needs to distinguish between 
high-frequency polymorphisms, which are being reshuffled by recombination as approximated by our model and the 
low frequency new polymorphisms due to recent (on the time scale of clonal growth) mutations. The latter would 
appear on the background that is clonal with respect to the high-frequency alleles. Thus the "clones" emerging at 
low r should be thought of as haplotypes ( |McVean and Cardin| |2005[ ) and the "clonal condensation" is the process 
that suppresses the number of haplotypes on small length scales. 

The participation ratio can be readily generalized to allow for a degree of genetic distance within a pair of individu- 
als. As currently defined Y = (S(\\g — g'\\)) (where \\g — g'\ \ stands for the genetic distance betwee n g and g'). This is 
immediately recognizable as a special case of the Parisi order parameter q(x) = (S(\ \g — g'\ \ — x)) (Me zard and Mom| 



HI 



tanari 



2009). The latter therefore provides an interesting representation of the haplotypic structure of populations. 



It would be interesting to understand whether more realistic models of fitness landscapes (with low order, rather 
than random epistasis) would generate more complex hierarchical structure of q(x) than the simple "dust/clone" 
dichotom y found in our simply model. Th e relation between the REM and Sherrington-Kirkpatrick models of spin 
glasses ( [Sherrington and Kirkpatrick 1975| ) could pro vide useful guidance and ult i mately yield better understanding of 



haplotype d istributions and recombinant coalescents (| Hudson and Kaplan| |1995| |Myers and Griffiths) |2003 Stephens 
etal] 120011). 



C. Future Directions 



The analysis of "clonal condensation" presented above can and should be extended in several ways. Within the 
confines of the model considered, one may want to obtain a better understanding of the "mixed phase" lying between 
the two transition lines identified in Fig. [8] This phase is characterized by large fluctuations in clone size distribution, 
and hence in Y, even in the approximation imposing a fixed traveling wave velocity v. In reality, the population sets 
its own instantaneous rate of change of average fitness, which depends sensitively on the fitness of the leading clones 
and changes with time as new fitness "records" are established by fresh recombinants. We have already described in 
Fig. [9] the "stick-slip" dynamics, which is characteristic of the mixed phase regime. (Needless to say, the existence 
of the mixed phase region corresponds to the 1st order nature of the clonal condensation transition for h ^ 0,1. 
) A fully quantitative description of this behavior would require going beyond MFT. So far, attempts to describe 



fluctuations in the dynamics of adaptation have been few and far apart (Brunet et al 2007 Hallatschek 2011 Neher 



and Shraiman, 2012): a quantitative description of the "stick-slip" progress of adaptation would represent a major 



step forward. 

Another necessary extension of the model involves mutational influx. A non-zero mutation rate would provide an 
influx of genetic variation and define true statistically stationary states corresponding to adaptive traveling waves 



(Cohen et al 2005a Desai and Fisher, 2007 Neher et al , 2010 Rouzine et al , 2003 Tsimring etal] 1996) or, in the 



presence of both deleterious and beneficial mutations, to a dynamic mutation selection balance (Goyal et al 2011). 
In that case, emergent clones become "fuzzy" as they accumulate mutations, and the participation ratio should 
be replaced by a more general Parisi order parameter as explained above. The result should provide interesting 
quantitative insight into the expected genetic structure of facultatively sexual populations. 

Perhaps the most interesting and important extension of the present work would involve a generalization of the 
model and its analysis to linear genomes and obligatory sexual populations. In contrast to the current model where 
recombination freely re-assorts the loci, a more realistic linear chromosome model would describe recombination in 
terms of relatively infrequent crossover. This would naturally tie the frequency of recombination to the length of the 
segment considered. We expect that on sufficiently short scales, where recombination is infrequent, a tendency for 
haplotype condensation would be manifest if the population is diverse enough. Whether epistasis plays a sign ificant 
role in the condensat ion process will depend on the distribution of epistatic interactions along the genome (|Neher| 
and Shraiman| [201 la|). If there is a strong tendency of mutations to interact with mutations nearby (iCallahan et al' 



2011), the heritability decreases as smaller and smaller segments are considered. This could result in condensation of 



mutations into "super-alleles" . However, the embedding of the haplotype in ques tion into a larger ge nome gives rise 
to co mplications related to Hill- Robertson interference ( |Barton||l994||Hiil and Robertson [l966||Hudson and Kaplan| 
1995). Transient associations with other genomic regions will either boost (the hitch-hiking process) or suppress 
(background selection) the spread of a haplotype in the population. This re duces the efficacy of selection and gives 
rise to a stochastic dynamics with rather different properties than genetic drift ( Neher and Shraiman||20lTb ). Bridging 
the different scales and resulting dynamical regimes represent an important challenge for the future. 



D. Conclusion. 

In conclusion, we stress the distinction between the QLE and "clonal condensation" regimes. In the QLE regime, 
recombination is sufficiently rapid to overwhelm any clonal amplification due to selection, and correlations between 
alleles at different loci are relatively weak. In this regime, the correlations between loci are well described by linkage 
disequilibrium, which measures population averaged pair correlation of loci. By contrast, clonal condensation is a 
non-perturbative, large deviation from linkage equilibrium (under which loci are completely uncorrelated) , which in 
particular results in a stratification of the population depending on its fitness: clones appear predominantly in the 
upper reaches of the fitness distribution. Strong heterogeneity among individuals along the fitness axis is not captured 
by measuring linkage disequilibrium and other traditional approaches. Understanding strong interactions in multi- 
locus systems requires new ideas and tools. We have found simple models such as the REM to be a very useful source 
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of insight into these non-trivial aspects of population genetics. 
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Appendix A: Participation ratio for Clonal Condensation. 

Following its definition for the REM we define the participation ratio for a set of clones as the sum of squared 
frequencies 

<*> = <E(^rk) W=E rdzz(nUt)e-^l[e-^), (Al) 

where 

nFuTi{t) = e ^-^Sl Ti ^(F) T , (A2) 

describes the size of clone i created a time T{ in the past with fitness F{. Due to stochastic effects, most clones 
go extinct early on, and the fraction a that remains is on average larger by a factor of a -1 . We will ignore this 
correction and reinstantiate it later when comparing to simulations. The rate of growth of the clone is controlled by 
the "replication rate": the fitness relative to the time dependent average fitness of the population (F)(t) minus the 
recombination rate. The average over clones (...) implies averaging over Fi and T{. It can be decomposed into the 
average over the initial population (of individuals present at t = with n = t ) and the average over subsequent 
recombinants that are generated via Poisson process with rate Nr. Hence 



U>| N oo Nrt ( ft r } k 

dFp(F)e- znF ^ t) ^ x fcj— [ Nr J dr J dF P( F ) e ~ znF ' At) ] (A3) 



(A4) 



= jl - J dFp(F) (l - e ~ znF ^ J x exp ~Nr dr J dFp(F) (l - e ~ znF ^ 

' (A5) 
« exp[-NC (z,t)-NC r (z,t)] , (A6) 

where we have defined 

Co CM) = J dFp(F) [l - e - znF ^ , (A7) 
which expresses the average over the initial population and 

C r (z,t) = r J^dr J dFp(F) [l - e -* n ^ (£) ] (A8) 
which gives the contribution of all recombinants. If we further define 

B (z,t) = J dFp(F)n 2 F)t (t)e- zn ^ = -^C (z,t) , (A9) 

B r (z,t) = rj^drj dFp(F)n 2 F)T (t)e- zn ^ = -^C r (z,t) , (A10) 
we arrive at the following expression for the participation fraction: 

P oo 

(Y t ) = N dz z[B (z,t) + B r (z,t)]e- N ^ c °^ +C ^ z ' t)) (All) 
Jo 
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At sufficiently long times, rt ^> 1, the population is dominated by new recombinants so that Bq and Co may be 
neglected in comparison with B r and C r . 



The growth and decay of clones, Eq. (A2), depends critically on the dynamics of the mean fitness, which in turn 
depends on the heritability of fitness. We have discussed the different cases (h 2 = 0, 1 and intermediate heritabilites) 
in the main text and will present the calculations pertaining to them in separate appendices below. To streamline 
the calculation, we will rescale all rates with cr (r — >• r/cr, A — »> A/a, E — >> E/cr, F —> F/a) and multiply times with cr 
(t — y tcr ^ t — y tct). Consequently, v now equals 7. We will reinstantiate cr in the final results to facilitate comparison 
with formulae in the main text. In addition, the following integrals will prove useful: 



J dxe ax+2bx - eb * 



a b 



< - ir,2 



'(l-e-«) = 

40 



) 



: r f(l_ e -«) 



£ * e 



(A12) 
(A13) 



where the latter holds only for a < b. 



Appendix B: Participation ratio for the fully epistatic h = case 

The fully epistatic case with recombination or mutation is the closest relative of the REM in population genetics, 
known in the population genetics literature as the "house-of-cards" model (Kingman 1978). In this model, every 
new genotype is drawn from the same distribution which we will take to be the standard normal distribution, that is, 
fitness is non- heritable. 

Let us consider rt ^> 1 so that enough recombinants are produced to dominate over the initial set of genotypes 
and evaluate C r {z,t) and B r (z,t). Furthermore, let us assume that the mean fitness is constant (we will revisit this 
assumption later). The size of a clone with age r is then simply n = ex.p[(E — f)r], and C r (z,i) is given by 

C rA z) « r /\fr r ^Le-^[l-e-^] . (Bl) 

JO J-00 V^TT 

Our strategy in approximating this integral will be to identify the region where $ = ze^ E ~ r ^ T < 1 and expand the 
exponential in that region; outside that region we shall neglect the exponential compared to 1 (in the square 
bracket). Defining = E — f, we have 

-j^e—^ \ dr[l-e-^} (B2) 

-00 v\J lit Jo 



• / — =e 2 \e m -l } + zr / — =e / dre T 

J-ooOV2^ Jo 0V2^ Jo 



(B3) 



(IS (0 + r) 2 

e 2 



t~ 1 \ogz- 1 

£ _1 log z 



te 



dr[l - e~ zeT ] (B4) 







_eI 

rte 2 



(E*-r)E 2 V2^ : 

where E* = t~ 1 \ogz~ 1 + f and we have along the way assumed that t > E*(t), which is realized for t > t* 



(B5) 
(B6) 



|[1 + Y 1 + ^ logz 1 ]. As we shall see below, the dominant contribution to Y comes from the region z 1 ~ A/", so 

that in the low r regime (f < >J2 log A") we have t* ~ ^/\ogN. At the end of the calculation we shall check that t > t* 
condition is satisfied. 

At short times the second term in our expression for C r j(z) is clearly smaller than the first, so that the dz integration 
in the expression for Y is controlled by the e~ Nz factor, which set the scale for z at z ~ A --1 . Let us next estimate 
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the time for which the second term in the expression for C r , t (z) becomes larger than the first provided that z ~ TV 1 . 
We obtain 



(t- 1 log N + ry 

e 2 < 



rt 2 N 



(t- 1 log TV + f) 2 log 7VV2tt 
Let us define t = t/ \ log TV and f = r/v/2 log TV; then 



(B7) 



A/JlogTV^- 1 +f) 2 V27r 



(B8) 



For f > r c = 1 this can only be satisfied if f 2 > t\ ~ A/" 7 * 2 1 >/log TV, which diverges in the large TV limit. Below r c , 
but still close to it, we can approximate the crossover t* ^> f _1 , and the crossover condition becomes 



1 — f 
2f 



2 loo 



4fV7r log A/" 



2f log TV 



(B9) 



which is an approximation valid for r ~ r c = >/2 log TV. 

It remains to calculate the participation ratio as a function of t asymptotically for t^> t c and for t « t c . To do so, 
we need to evaluate the 5-integral and perform dz integration in the integral representation for Y(t). The 5-integral 
can be obtained by differentiating C r (z,v) twice with respect to z, which evaluates approximately to 



B(z,t) 



-2 d 2 

dE'i 
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rte 2 
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(E*-r)E*V2K 



(BIO) 



Hence 
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(Bll) 



(B12) 



(B13) 



i(l) « x/^giVA 



which upon substitution into Eq. B12 yields the final result 



log crt 
log TV 



(y> « exp 



-TVcre" V21ogiVCT(1 "^ )ta 



(B14) 



(B15) 



This result deviates from zero for t > t c (r) = rc 2 ^ 2 y !°^ — consistent with our expected t c for r ~ r c . For r > r c the 
participation ratio stays small for all £. 
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Appendix C: Participation ratio for the general traveling wave 

As before, the participation fraction is given by an integral over clones seeded in the past. As opposed to the fully 
epistatic case, however, we now consider a finite velocity, which limits the lifetime of clones. A clone seeded at time 
r in the past at fitness A above the mean has a size 

rij = e ( ^- r ~ )r ^-^ , (CI) 

where r = r + (E). New clones get see ded with rate 7VY, and to calculate (Y), we need to evaluate the integrals 
C r (z,v) and B r (z,v) as defined in Eqs. (A8) and (A10). In contrast to the case discussed above with v = 0, a finite 
v causes clones to have a limited life-time, even if they are initially very fit. 

The integral C r (z,v) over F and r has one contribution from young (small) clones with average fitness, in which 
case zrij is small. This young contribution evaluates to 

CV(z,v)=zr J™ dr J ^Le-^+( F -^ T -^ « ™ . (C2) 

Here we have evaluated only the contribution from young clones and neglected the fact that there is a growing 

quadratic term in the exponent. The latter is due to old clones; we will evaluate this term next. If v < 1, the 

i 2 

integrand starts growing again with r and until it is cutoff at \ogz~ l — Or — For a given r, this cut-off is at 
0* = t -1 \ogz~ 1 + vt/2, and the integrand has a peak at 0* if (1 — v)r > r. In this case, we expand the integrand 
around 0*, i.e., F* = 0* + f, and find for the old clones 

C° r (z,v)^r -=e 2 —e* dd (l-e e ) = -r ~ 7 ^e 2 T( ), (C3) 

JO V27T J V27T JO V27TT T 

The remainder can be evaluated by assuming that the non-exponential parts are slowly varying. The integrand peaks 
roughly at lo ^f 2 = | or r* = \j2v~ 1 log z~ x . This translates into F* = y/2v log z~ x + f, and the curvature is 
2F* logz- 1 //* 3 = vF*/t* « v 2 . Hence 



C ' M "~ 7W r H 1 + 7^')- <C4) 

B r (z,v) can be evaluated by differentiating C twice, which yields 



r7 v-2 p -ry/2v logz- 1 -^ ( ~ \ 

*<*• •> - r 2 - " (1 + 7m^ ] ■ (C5) 



Additive fitness functions 

In the additive case with v = 1, the cut-off C r (z,v) is dominated by the young clones for any recombination rate. 
Its second derivative, however, is dominated by the contribution from old clones if r < (y/2 — l)\/2 log N . In this 
regime, we find 



f°° r p~ r V' 2 lo s z- 1 - r 2 

(Y(r, 1)) ~ N dz T(l - )e~^ ~ e~ ^^"^ . (C6) 

Jo v / 21ogz- i v / 21ogz- i 

In the last step, we have dropped all preexponential factors and reinstantiated a. This correction approximately 
accounts for the fact that only a fraction a of the clones that are seeded are successful. 

For larger r, even the B r (z,h) term is dominated by young clones and, Y ~ A^ _1 . This behavior holds while 
the recombination rate is larger than l/y/2 log N. For smaller r, the velocity starts to deviate from 1. In this case, 
C r (z, h) is dominated by old clones. For large enough populations, the powers of z vary much more quickly than all 
other terms, which we denote collectively by A(z), and we can approximate the integral as 

(Y(r,h)) = - [°° dzz^AizW - V{9 \ . + r) )/AMr(-^) 
Jo 

_ i r(2-jgg) g*+r v(0-+r) (C7) 

v y( v{ - e * +r "> ) o* ^ e* ' ~ u 7 ' 
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where in the last step we reintroduced a by replacing v with 7 = v/a 2 . Along the way, we have assumed r <C 
ay/2 log Na, which is consistent with the assumption of small r made above. 

Epistatic fitness function 

In cases with heritabilities < h 2 < 1, we generally have v < 1 and can be in a regime where C r {z,v) « z, while 
B r (z,v) is dominated by old clones. In this case, we have 



(F(r,/i)) « JV / dzz v ~ x e~ Nz ~ jyi-7 e - W^iog^-^ _ (cg) 

Jo v 2v log z 1 

This starts to be of order iV -1 as soon as f < a(y/2 — 7)^/2 log Ncr. This is similar to the condition identified in 



( |Neher and Shraiman , 2009) based on the factorization of the mean field solution. Note, however, that this expression 



does not hold near the v = line, since it is derived assuming a steady traveling wave. 

Appendix D: Velocity in the low recombination limit 

With sufficiently frequent recombination, the mean fitness in the population increases with a rate v given by the 
variance in additive fitness among recombinant offspring. At lower recombination rates, however, the fitness variance 
in the population gets reduced below that of the distribution of recom binants. With reduced the additive variance, v 
decreases. This effect has been studied in (Rouzine and Coffin, [2005 ) for a model with only additive fitness, and we 



reproduce this argument for our simplified recombination model. 

We will again work in units where the variance of the fitness distribution of recombinant genotypes is 1 and measure 
fitness a relative to the instantaneous mean fitness in the population. The invariant fitness distribution P(a) in the 
comoving frame a — A — vt is governed by 

- vd a P(a) = (a- r)P(a) + (Dl) 

V 27T 



where the last term accounts for the injection of recombinant offspring. This equation has the solution 

^/ X 1 ) 2 r C a' 2 , (a'-r) 2 

P(a)=rv~ 1 e *r~ —^e~^ + ^^. (D2) 

J a V27T 

Note that this solution becomes negative for a > a c , and only the part a < a c is of interest. The zero crossing a c 
marks the position of the most fit individuals in the population, and its value will depend on the population size. 
To determine the velocity of a finite population, Rouzine and Coffin ( 2005| ) compared the rate at which new 



genotypes ahead of a c (records) are created to the speed at which the bulk of the population moves towards higher 
fitness. 

Within the model, recombinant genotypes are generated with rate rN and drawn from a Gaussian distribution with 
unit variance. Hence genotypes with a > a c are produced at rate 



rN 



f°° da _^ rN _«A 

/ ^=e 2 ~ e 2 . (D3) 

Ja c V27T a c V27T 

Such a genotype seeded a c ahead of the mean has a chance a c a to establish and, if established, advances the nose by 
— . We therefore find for the speed of the nose 



arN 

e 2 (D4) 



To solve for v nose = v and a c , we need an additional relation which can be obtained from the condition that P(a) is 



normalized. In the limit of small r, we can approximate Eq. (D2) 
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The approximation is valid for a c (l — v) ^> r; otherwise the population distribution is a simple Gaussian, and we 
find v « 1. Note that this condition is the same as the one we encountered above in the context of whether C r (z,v) 
is dominated by young or old clones. For a c (l — v) ^> r, C r (z,v) and equivalently the normalization of P{a) are 
dominated by the old clones. The normalization condition on P(a) now requires 



(i-v)g 



which yields 



2v 



(1 - v)a c - r 

(1 - v)a c 



(1-v) 



log 



1 , 



2v ^2(1 - v) 
log^ 



Solving Eq. (D4) for a c and equating the result with Eq. (D7) results in 



1 



2v ^2(1 - v) 
log-* 



2 log 



arN 



(D6) 



(D7) 



(D8) 



After rearranging and replacing v by 7 = v/o , we obtain the expression given in Eq. (19). 



Appendix E: Mean field solution 

Above, we discussed the large r solution to the equation 



T ^2 



P(A,E) = (F-(F)-r)P(A,E) + -==e M p(E) , (El) 

where we have introduced the symbol p(E) for the distribution of epistatic fitness. At large r, this model admits a 
factorized solution P(A,E) = $(A,t)uj(E) with 

A/27r(T^ r + C - E 

The epistatic part needs both to be normalized, and C needs to equal the mean of the distribution w{E). Those two 
are linked: when one is true, so is the other. Hence 



r P( E ) f (E-C\ n _ 1 f ^^^E-C ^ (E-C s 

(E3) 



7 n=0 ^ ' ' 7 n=0 



r J r + G— r r 



Hence the constant C, adjusted to achieve normalization, equals the population mean of E. 
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